Red Wines by Eileen Hertwig

This report explores a dataset containing information about different characteristics of red wines. The data contains 1599 observations of 13 variables.

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

A summary of the dataset gives a quick overview.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The objective of this dataset is to meaure the quality of red wine with is assessed based on sensory data (median of at least 3 evaluations made by wine experts). The scale ranges from 0 (worst) to 10 (best) Above a histogram and a table of the quality rating show that most wines are rated in the middle numbers and the very good and very bad ratings are missing completely.

The input variables that could possibly influence the quality of the red wines are fixed acidity, volatily acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. Histogram of all 11 variables are shown above to get an overview.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

It is a bit surprising (for me) that the alcohol content ranges from 8.4 to 14.9%, so I will look at the distribution in more detail.

The alcohol content seems to vary a lot. While most wines have alcohol contents of 10 or below (mean and median are indicated in blue and red lines, respectively), the distribution is right tailed and there are many red wines with much higher alcohol contents. I wonder if this quantity is related to the quality of the wine.

## 
##  0.9  1.2  1.3  1.4  1.5  1.6 1.65  1.7 1.75  1.8  1.9    2 2.05  2.1 2.15 
##    2    8    5   35   30   58    2   76    2  129  117  156    2  128    2 
##  2.2 2.25  2.3 2.35  2.4  2.5 2.55  2.6 2.65  2.7  2.8 2.85  2.9 2.95    3 
##  131    1  109    1   86   84    1   79    1   39   49    1   24    1   25 
##  3.1  3.2  3.3  3.4 3.45  3.5  3.6 3.65  3.7 3.75  3.8  3.9    4  4.1  4.2 
##    7   15   11   15    1    2    8    1    4    1    8    6   11    6    5 
## 4.25  4.3  4.4  4.5  4.6 4.65  4.7  4.8    5  5.1 5.15  5.2  5.4  5.5  5.6 
##    1    8    4    4    6    2    1    3    1    5    1    3    1    8    6 
##  5.7  5.8  5.9    6  6.1  6.2  6.3  6.4 6.55  6.6  6.7    7  7.2  7.3  7.5 
##    1    4    3    4    4    3    2    3    2    2    2    1    1    1    1 
##  7.8  7.9  8.1  8.3  8.6  8.8  8.9    9 10.7   11 12.9 13.4 13.8 13.9 15.4 
##    2    3    2    3    1    2    1    1    1    2    1    1    2    1    2 
## 15.5 
##    1

It seems that there are quite a few outliers on the higher end of residual sugar, so I will zoom in.

Disregarding the outliers on the higher end, residual suger is almost normally distributed around its mean value (close to two).

The distribution of fixed acidity is right tailed. Most wines have values around 7-8 g/dm³, but some wines have much higher fixed acidity.

Volatile acidity is almost normally distributed when neglecting the highest 1% of the values (the 99th percentile is indicated by a red line).

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

Most wines have values of citric acid around 0 g/dm³. There are, however, red wines in the dataset with higher values and one outliers with 1 g/dm³ citric acid.

Most red wines contain very little sodium chloride (between 0.05 and 0.10 g/dm³), but the original dataset (without zooming in) contains wines with salt levels up to 0.6 g/dm³ (but a very small number).

Free sulfur dioxide prevents microbial growth and the oxidation of wine. Most red wines have levels below 40 mg/dm³, but the mode of the distribution is found to be much smaller (around 5 mg/dm³).

In low concentrations total sulfur dioxide should not influence the quality of the wine, because “in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine” (from the data documentation). In the above plot the limit of 50 ppm (=50 mg/dm³) is indicated by a green line, showing that while many wines in the dataset have values below, some have much higher levels of total sulfur dioxide.

The density of the wines is close to that of pure water (=1 g/cm³, indicated by the blue dashed line), however, the mean (red solid line) is slightly lower. This is probably due to the alcohol which is lighter than water. A small amount of wines is more dense than pure water. These could be the ones that contain a lot of residual sugar.

The pH value describes the acidity of the red wines. The distribution is almomst normal with most values between 3.0 and 3.7.

Most wines have relatively low sulphates levels, but there are a few outliers at the higher end, so I zoom in excluding the top 1%.

Univariate Analysis

What is the structure of your dataset?

The red wines dataset contains 1599 observations of 13 variables. The first variable is an id number. The following 11 describe properties of the red wines: fixed acidity, volatily acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The last variable is the output, the quality of the red wine. The quality is based on sensory data (median of at least 3 evaluations made by wine experts). The scale ranges from 0 (worst) to 10 (best).

What is/are the main feature(s) of interest in your dataset?

The main feature in this dataset is the quality of the red wines. However, most red wines have been rated with medium marks. Very good and very bad ratings are missing in this data set.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I suspect alcohol will influence the taste of the wine, but also other chemical properties might be important. A bi- or multivariate exploration of the variables is needed to find out.

Did you create any new variables from existing variables in the dataset?

In the bivariate analysis section I found that because most ratings are in the medium range, I needed to created quality groups of “bad wines” (ratings of 3 and 4), “medium wines” (ratings of 5 and 6), and “good wines” (ratings of 7 and 8).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There are quite a few outliers on the high end for residual sugar content. Therefore, when looking at this quantity I exclude the top 5%. For other quantities it is also often convenient to ignore the highest and lowest 1%.

Bivariate Plots Section

Alcohol content and quality seem to be related (higher alcohol seems to go along with a higher quality), but I am not able to find a clear linear trend. Also using techniques to avoid overplotting helps getting a clearer picture, but does not show me a clear linear relationship. I will have to look at further variables.

At a first glance residual sugar does not seem to have a high relationship with the quality of the wines. However, most values of residual sugar are below 4 with some very high outliers, so I will remove the top 5%.

Even zooming in does not show me a clear relationship between residual sugar and the quality of the red wines. I will move on to other variables.

From this scatterplot a linear relationship between quality and fixed acidity is not obvious.

There seems to be an increase in quality with a decreasing volatile acidity. This is reasonable, since high values of volatile acidity lead to a vinegar-like taste. I will have to explore this further.

There seems to be a positive relationship between quality and citric acid, but because most data points are in the quality ratings of 5 and 6, it is hard to see a real linear relationship from this plot.

With zooming in (excluding the top 3% and bottom 1% of data points), there seems to be a slight negative relationship between quality and chlorides.

A relationship between free sulfur dioxide and quality cannot be found from this scatterplot.

Excluding the top 1% of data points there is a (very) slight hint of a negative relationship between quality and total sulfur dioxide, but this scatterplot alone is not convincing enough.

A small negative correlation between quality and density can be seen here. This can, however, be related to the fact that density is strongly controlled by the alcohol content (which I suspect and will explore later on) and alcohol is related to the quality.

Even when zooming in (excluding the top and lowest 1%), a relationship between pH and the quality cannot be found from this scatterplot.

When excluding the top 1% of data points, a positive linear relationship between quality and sulphates can be seen in the above scatterplot.

Looking at scatterplots of the other chemical properties vs. quality I can see some positive and some negative relationships but mostly it is hard to say, because most wines received ratings of 5 and 6. I wonder if it might be better to just compare good wines (ratings of 7 and 8) to bad wines (ratings of 3 and 4).

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :  19.0   Min.   : 4.600   Min.   :0.2300   Min.   :0.0000  
##  1st Qu.: 435.0   1st Qu.: 6.800   1st Qu.:0.5650   1st Qu.:0.0200  
##  Median : 834.0   Median : 7.500   Median :0.6800   Median :0.0800  
##  Mean   : 837.7   Mean   : 7.871   Mean   :0.7242   Mean   :0.1737  
##  3rd Qu.:1285.5   3rd Qu.: 8.400   3rd Qu.:0.8825   3rd Qu.:0.2700  
##  Max.   :1522.0   Max.   :12.500   Max.   :1.5800   Max.   :1.0000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.04500   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.06850   1st Qu.: 5.00      
##  Median : 2.100   Median :0.08000   Median : 9.00      
##  Mean   : 2.685   Mean   :0.09573   Mean   :12.06      
##  3rd Qu.: 2.950   3rd Qu.:0.09450   3rd Qu.:15.50      
##  Max.   :12.900   Max.   :0.61000   Max.   :41.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9934   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 13.50       1st Qu.:0.9957   1st Qu.:3.300   1st Qu.:0.4950  
##  Median : 26.00       Median :0.9966   Median :3.380   Median :0.5600  
##  Mean   : 34.44       Mean   :0.9967   Mean   :3.384   Mean   :0.5922  
##  3rd Qu.: 48.00       3rd Qu.:0.9977   3rd Qu.:3.500   3rd Qu.:0.6000  
##  Max.   :119.00       Max.   :1.0010   Max.   :3.900   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.60   1st Qu.:4.000  
##  Median :10.00   Median :4.000  
##  Mean   :10.22   Mean   :3.841  
##  3rd Qu.:11.00   3rd Qu.:4.000  
##  Max.   :13.10   Max.   :4.000
## [1] 63 13

The above shows the summary for the “bad wines”, i.e. quality ratings of 3 and 4. 63 of the original dataset are considered “bad wines”.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1.0   Min.   : 4.700   Min.   :0.1600   Min.   :0.0000  
##  1st Qu.: 382.5   1st Qu.: 7.100   1st Qu.:0.4100   1st Qu.:0.0900  
##  Median : 768.0   Median : 7.800   Median :0.5400   Median :0.2400  
##  Mean   : 793.0   Mean   : 8.254   Mean   :0.5386   Mean   :0.2583  
##  3rd Qu.:1219.5   3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4000  
##  Max.   :1599.0   Max.   :15.900   Max.   :1.3300   Max.   :0.7900  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.03400   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07100   1st Qu.: 8.00      
##  Median : 2.200   Median :0.08000   Median :14.00      
##  Mean   : 2.504   Mean   :0.08897   Mean   :16.37      
##  3rd Qu.: 2.600   3rd Qu.:0.09100   3rd Qu.:22.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.860   Min.   :0.3700  
##  1st Qu.: 24.00       1st Qu.:0.9958   1st Qu.:3.210   1st Qu.:0.5400  
##  Median : 40.00       Median :0.9968   Median :3.310   Median :0.6100  
##  Mean   : 48.95       Mean   :0.9969   Mean   :3.311   Mean   :0.6473  
##  3rd Qu.: 65.00       3rd Qu.:0.9979   3rd Qu.:3.400   3rd Qu.:0.7000  
##  Max.   :165.00       Max.   :1.0037   Max.   :4.010   Max.   :1.9800  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :5.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.00   Median :5.000  
##  Mean   :10.25   Mean   :5.484  
##  3rd Qu.:10.90   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :6.000
## [1] 1319   13

The above summary is for the “medium wines” with ratings of 5 and 6. 1316 members belong to this group.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   8.0   Min.   : 4.900   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 482.0   1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000  
##  Median : 939.0   Median : 8.700   Median :0.3700   Median :0.4000  
##  Mean   : 831.7   Mean   : 8.847   Mean   :0.4055   Mean   :0.3765  
##  3rd Qu.:1089.0   3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900  
##  Max.   :1585.0   Max.   :15.600   Max.   :0.9150   Max.   :0.7600  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :1.200   Min.   :0.01200   Min.   : 3.00      
##  1st Qu.:2.000   1st Qu.:0.06200   1st Qu.: 6.00      
##  Median :2.300   Median :0.07300   Median :11.00      
##  Mean   :2.709   Mean   :0.07591   Mean   :13.98      
##  3rd Qu.:2.700   3rd Qu.:0.08500   3rd Qu.:18.00      
##  Max.   :8.900   Max.   :0.35800   Max.   :54.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9906   Min.   :2.880   Min.   :0.3900  
##  1st Qu.: 17.00       1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500  
##  Median : 27.00       Median :0.9957   Median :3.270   Median :0.7400  
##  Mean   : 34.89       Mean   :0.9960   Mean   :3.289   Mean   :0.7435  
##  3rd Qu.: 43.00       3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200  
##  Max.   :289.00       Max.   :1.0032   Max.   :3.780   Max.   :1.3600  
##     alcohol         quality     
##  Min.   : 9.20   Min.   :7.000  
##  1st Qu.:10.80   1st Qu.:7.000  
##  Median :11.60   Median :7.000  
##  Mean   :11.52   Mean   :7.083  
##  3rd Qu.:12.20   3rd Qu.:7.000  
##  Max.   :14.00   Max.   :8.000
## [1] 217  13

A summary for “good wines” with ratings of 7 or 8 is shown above. 217 red wines of the original sample are considered “good”.

## 
## (2,4] (4,6] (6,8] 
##    63  1319   217

Before I created subsets of the original dataframe to look at the summaries, but to make plots, I have created a new variable “quality.bucket”.

The boxplots reveal that the median of fixed acidity increases from bad to median to good wines. The IQR gets also greater, but comparing the spread might not be very meaningful considering the very different sample sizes in the three groups.

For volatile acidity the scatterplots already showed a negative linear relationship to quality. The boxplots for the three quality groups confirm this.

While the scatterplot of quality vs. citric acid was hard to read, the boxplots show a clear increase in quality with citric acid.

Even when excluding the top 5%, a relationship between quality and residual sugar can still not be determined by the boxplots.

“Bad” wines have higher chlorine levels than “good” wines, but the IQRs overlap almost entirely and the difference is not great. A real relationship cannot be determined from this plot.

Just as the scatterplot, the boxplots do not show a linear relationship between quality and free sulfur dioxide.

The scatterplot gave a hint of a negative linear relationship between quality and total sulfur dioxide, but the boxplots do not confirm this. “Bad” wines have median levels of total sulfur dioxide that are about the same as for “good” wines. Only “medium” wines have higher levels.

“Good” wines have a lower median density than “bad” or “medium” wines, but a linear relationship between quality and density cannot be found from the boxplots.

The boxplots reveal a slight negative relationship between quality and pH, but since the IQR overlap almost entirely for “medium” and “good” wines, this seems not to be very significant.

The positive linear relationship between quality and sulphates found in the scatterplot can be confirmed by the boxplots.

While alcohol content seems to be very important for the high quality wines (“good” wines have much higher alcohol content), it does not significantly differ between “bad” and “medium” wines.

Now looking at buckets of quality ratings, the picture is much clearer than just from the scatterplots alone.

There seems to be a positive linear relationship between quality and fixed acidity as well as citric acid and sulphates. For alcohol content the grouping into quality groups shows that “bad” and “medium” red wines have on average the same alcohol content (even though medium wines have more outliers in the positive direction), but good wines have a much higher alcohol content. This does not seem to be a strictly linear relationship though.

A negative linear relationship to quality can be found for volatile acidity and the pH value.

The other variables do not show a linear relationship to quality even when grouping quality ratings.

To complete the picture, I will compute Pearson’s correlation coefficients for all variables vs quality.

## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  quality and chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## 
##  Pearson's product-moment correlation
## 
## data:  quality and free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  quality and total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## 
##  Pearson's product-moment correlation
## 
## data:  quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The Pearson’s correlation between the quality (now again the original dataset) and the variables sometimes gives a different picture than the plots. For example, from the correlation coefficient it appears that alcohol content has the strongest correlation to the quality of the red wine.

Fixed and volatile acidity do not seem to be linearly related. However, if both are plotted (individually) against citric acid linear relationships can be observed (positive for fixed and negative for volatile acidity).

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309
## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

The Pearson’s correlation coefficients confirm these relationships. However, it appears that fixed and volatile acidity are also negatively linear related with a rather weak correlation coefficient though), even though it is not obvious from the plot.

## 
##  Pearson's product-moment correlation
## 
## data:  density and residual.sugar
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3116908 0.3973835
## sample estimates:
##       cor 
## 0.3552834
## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

The density of red wines is very close to the density of water (1 g/cm³), but decreases with increasing alcohol content and increases with increasing residual sugar content.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Most wines had been rated with a quality mark of 5 or 6 with only fewer wines receiving rating of 3, 4, 7, or 8. Therefore, to really see a linear relationship, it was necessary to group the ratings of the red wines into three categories and compare boxplots for the three groups.

Some but not all of the variables in this dataset show a clear relationship to the quality of the red wines. Positive linear relationships can be observed for fixed acidity, citric acid and sulphates. Residual sugar and alcohol content also show positive linear relationships, but the relationships seem to be weaker. Volatile acidity and the pH value have negative linear relationships to the quality of the red wines.

The other variables do not show a linear relationship, but that does not mean that they are completely unrelated to the quality of the wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The density of the red wines is determined mostly by the alcohol content and the residual sugar. It decreases with increasing alcohol content and increases with increasing residual sugar content.

Fixed and volatile acidity seem to be linearly related to citric acid (and weakly to each other). However, the relationship is not that great that one can substitute the other two completely when looking at the quality of the red wines.

What was the strongest relationship you found?

According to the plot and the Pearson’s correlation coefficient volatile acidity seems to have a very strong relationship to the quality of the red wine. That makes sense, as the description says: “volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”.

The correlation coefficient for alcohol was even higher, but here the relationship does not seem to be strictly linear as the difference for “bad” and “medium” wines is very small. Only for “good” wines the alcohol content is much higher.

Multivariate Plots Section

This plot confirms again that the density of the wines increases with increasing residual sugar and decreasing alcohol. This multivariate plot combines these two facts so that only one plot is necessary instead of two.

For the bivariate analysis I found that volatile acidity and alcohol have the strongest correlation coefficients. The plot above confirms again that both are strongly related to the quality of the red wines, however, for alcohol the relationship does not seem to be linear.

## # A tibble: 6 x 9
##   quality alcohol_mean volatile.acidit… citric.acid_mean fixed.acidity_m…
##     <int>        <dbl>            <dbl>            <dbl>            <dbl>
## 1       3         9.96            0.884            0.171             8.36
## 2       4        10.3             0.694            0.174             7.78
## 3       5         9.90            0.577            0.244             8.17
## 4       6        10.6             0.497            0.274             8.35
## 5       7        11.5             0.404            0.375             8.87
## 6       8        12.1             0.423            0.391             8.57
## # … with 4 more variables: sulphates_mean <dbl>, pH_mean <dbl>,
## #   residual.sugar_mean <dbl>, n <int>

A new dataframe gives me the mean of the important variables grouped by the quality ratings. A count (n) of how many members each group holds is also included.

Plotting the means for of alcohol and volatile acidity for each quality group together confirms again that the quality increases with increasing alcohol (especially for the higher quality wines) and with decreasing volatile acidity (especially for the lower and medium quality wines).

The scatterplot reveals the positive relationship citric acid has with the quality of the red wines. However, adding the pH value, it appears that pH is more related to citric acid than to the quality of the wines. Looking at the line graphs that show the means for each rating reveals why: for the medium rated wines (which is the biggest group) the pH value does not seem to vary much. Only the rather bad wines differ significantly from the better wines with having a higher pH value.

The positive linear relationship between quality and sulphates as well as fixed acidity can be seen in the above scatterplot (even though it is less obvious for fixed acidity).

The lineplot for the means of each quality group confirms that there is a clear positive linear relationship between the quality and sulphates and a less pronounced but still visible positive relationship between quality and fixed acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The two variables that are most strongly related to the quality of the red wines (in terms of correlation coefficient) are alcohol and volatile acidity. A combined scatterplot confirms these relationships. Even when holding levels of volatile acidity constant, alcohol content is higher in the higher quality wines. However, volatile acidity and alcohol do not seem to be totally unrelated to each other, since there is more often higher alcohol content to be found in wines with lower than in wines with higher volatile acidity.

While citric acid has a relatively strong positive linear relationship with the quality of the red wines, the pH value (which is related to citric acid) shows a much weaker relationship.

Fixed acidity and sulphates are both related to the quality of the wines but when plotting both together the plot is not easy to read anymore, probably because of too weak linear relationships.

Were there any interesting or surprising interactions between features?

Looking at it from different agles I could still not find a meaningful relationship between residual sugar and the quality of the wines.

The pH value is so strongly related to citric acid that holding citric acid constant, the variation in pH is not very strong and a relationship to the quality cannot be determined anymore.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The above plot is esentially the scatterplot from the bivariate section showing the alcohol content by the quality rating. It has been advanced by a colorscale for quality groups I have created (“bad”, “medium”, and “good” wines) and a line showing the mean alcohol content for each rating (like in the multivariate plotting section) has also been added.

This plot shows clearly that wines that receive higher ratings have on average higher alcohol contents. It also shows that most wines received ratings in the middle sections with only less ratings for “bad” and “good” wines.

Plot Two

Description Two

The above plot shows boxplots for volatile acidity, which was found to be next to alcohol the variable which has the strongest influence on the quality of the wines. The boxplots are grouped for the “bad wines” (ratings 3 and 4), “medium wines” (ratings 5 and 6) and “good wines” (ratings 7 and 8). It is essentially the same plot as the one from the bivariate section, but a line showing the mean volatile acidity for each quality rating is added.

Decreasing levels of volatile acidity are related to a higher quality rating. Only for the very high ratings a difference cannot be found anymore. That makes sense since high levels of volatile acidity give wine a vinegar-like taste. If the level of volatile acidity is already relatively low, it should not influence the quality anymore.

Plot Three

Description Three

Plotting volatile acidity and alcohol together versus the quality of the red wines the bivariate relationships can be confirmed again. Even when holding levels of volatile acidity constant, alcohol content is higher in the higher quality wines. However, volatile acidity and alcohol do not seem to be totally unrelated: there is more often higher alcohol content to be found in wines with lower than in wines with higher volatile acidity.


Reflection

The red wine dataset contains 11 variables that describe properties of wine as well as a quality rating. I have started out looking at a histogram of the ratings and found that most wines had been rated in the medium (5 or 6) range, without any ratings at very low or very high end.

I have also looked at the 11 variables to see how they are distributed. Trying to find out which properties of the red wines influence the quality the most, I have looked into (mostly linear) relationships. I found that the quality of red wines seems to improve with increasing alcohol, fixed acidity, citric acid, and sulphates and with decreasing volatile acidity and pH value. For residual sugar there was a hint of a positive linear relationship, but too weak to be sure.

I have tried to learn about the chemical properties of red wine to explore this dataset, but this does not substitute real domain knowledge which would have helped me to make this analysis more meaningful and would enable me to ask more interesting questions.

Since most of the wines had ratings of 5 or 6 and only very few of 3 and 4 or 7 and 8 (and no ratings at the lower or upper ends at all), it was not easy to find relationships. I had to use different ways to group the wines into quality groups to work with the data. A more diverse dataset containing data on wines from the entire quality spectrum (ideally evenly distributed) would make the exploration easier and would probably lead to more meaningful results. With such a dataset I would have tried to build a model to predict the quality of the red wines.